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ABSTRACT 

Hidden links are designed solely for search engines rather 
than visitors. To get high search engine rankings, link hid- 
ing techniques are usually used for the profitability of black 
industries, such as illicit game servers, false medical services, 
illegal gambling, and less attractive high-profit industry, etc. 
This paper investigates hyperlink hiding techniques on the 
Web, and gives a detailed taxonomy. We believe the taxon- 
omy can help develop appropriate countermeasures. Study 
on 5,583,451 Chinese sites' home pages indicate that link 
hidden techniques are very prevalent on the Web. We also 
tried to explore the attitude of Google towards link hiding 
spam by analyzing the PageRank values of relative links. 
The results show that more should be done to punish the 
hidden link spam. 

Categories and Subject Descriptors 

1.2.6 [Artificial Intelligence]: Learning; H.3.3 [Information 
Storage and Retrieval] : Information Search and Retrieval 

General Terms 

Measurement, Experimentation, Algorithms 

Keywords 

Web Spam, Link Hiding, Hidden Spam, Spam Detection 

1. INTRODUCTION 

Most Web surfers depend on search engines to locate in- 
formation on the Web. Link analysis algorithms [12], such 
as PageRank [13] and HITS [10], are usually used for Search 
engines ranking. Link analysis algorithms assume that ev- 
ery link represents a vote of support, in the sense that if 
there is a link from page x to page y and these two pages 
are authored by different people, then the author of page 
X is recommending page y. In particular, PageRank is the 
basis of Google's search technology [4]. 

Web spammers try to mislead search engines to make a 
high rank in search results [8]. In this context, hyperlink 
hiding techniques are often used to deceive search engines. 
Spammers hope that many small endorsements from these 
pages with hidden links result in a sizable PageRank for the 
target page. Several questions naturally arise: what link 
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hidden techniques are the spammers using; how prevalent 
are hidden spam links on the Web; and are the hidden pages 
punished by the search engines? This paper attempts to 
answer those questions. 

The rest of sections are organized as follows. Section 2 
presents a literature review. Section 3 gives a comprehensive 
taxonomy of current hidden link spam techniques. Section 
4 describes the experimental analysis on 5,583,451 Chinese 
Web sites. At last, section 5 draws the conclusion. 

2. RELATED WORK 

Hidden links are designed to increase link popularity, which 
are invisible for visitors [15]. Google considers hyperlinks 
hidden by small characters as deception [7]. Z. Gyongyi 
etc al. point out that hidden links are often used in honey 
pot to boost the ranking of the spam pages [9]. They fur- 
ther present a comprehensive taxonomy of current spam- 
ming techniques and survey content hiding techniques, where 
spam links hidden by avoiding anchor texts or tiny anchor 
images are mentioned [8]. 

To the best of our knowledge, there is no previously pub- 
lished literature that directly studied how prevalent, suc- 
cessful, or varied hidden link spam techniques are on the 
Web. This paper attempts to study hidden link spam in 
detail. It is hoped that the findings can help in developing 
appropriate countermeasures. 

3. HYPERLINK HIDING TECHNIQUES 

There are many different ways to hide links from visitors 
while leaving it perfectly viewable to search engines. In this 
section, we will examine current hyperlink hiding techniques 
used by spammers and attempt to categorize them based on 
their features. Just as the work on JavaScript redirection 
spam [5], we present short examples to show the hidden 
techniques really used by spammers. Simple techniques are 
presented first and are followed by more advanced ones. 

3.1 A: IMaking Anchor Text Font Color the 
Same as Background Color 

The simplest and oldest method that spammers use to 
create hidden links is to make the font of anchor text the 
same color as the background. Here is one example. 

<span style= "background: white ; " > 

<a href= "target.html" styles "color:white">anchor text</a> 
</span> 

In this example, the color scheme is defined in the HTML 



document. Color schemes can also be defined in an attached 
cascading style sheet file (CSS). Sometimes, spammers also 
consider background images. They set the image color to 
be the same as the font color, which is relatively harder to 
detect. 

3.2 B: Making Anchor Text Font Color Al- 
most Match Background Color or Back- 
ground Image 

Instead of setting the font color to entirely match the back- 
ground color, some spanmicrs and web masters set their font 
colors to almost match the background color. The idea be- 
hind this method is that they believe that they are thwarting 
the search engines' software detection systems by slightly 
changing the color of the text. 

<div style="background-color:whitc; " > 

<a href= "target.html" style= "color: #FEFFEE"> 
anchor text with similar color with background 
</a> 

</div> 

3.3 C: Setting Tiny Anchor Text or Placing 
the hyperlinks in a Tiny Block 

Making tiny anchor text is another hyperlink hiding method. 
This way, the hyperlink can be set small enough, such as 1 
pixel high, even pixel. Here's a simple example of that. 

<a href= "target.html" style="font-size:Opx">keyword</a> 

In HTML, the div element is often used for generic orga- 
nizational or stylistic applications. Spammers can also use 
div to set the link size. The following is another example. 

<div style="font-size:Opx;"> 

<a href= "target.html" >invisible anchor text</a> 
</div> 

Perhaps the most cormnon use of div clement is to carry 
class or id attributes in conjunction with CSS to apply lay- 
out, typographic, color, and other presentation attributes 
to parts of the content. In the previous example, the font- 
size:Opx can also be defined in a CSS file. Besides, div block 
size can be set via width and height attributes. For example, 
<div style= "width:lpx;hcight:lpx;" >, where the div size is 
1 pixel. 

Another example of hiding a hyperlink via tiny scrolling 
block is presented below. 

<MARQUEE scrollAmount=l width=l height=l> 

<a href= "target.html" >keywords< /a> 
</MARQUEE> 

In this example, target.html is put in a scrolling block 
with area 1x1 pixel, which is invisible to Web users. 



3.4 D: Disguising Anchor Text as Plain Text 

Sometimes, spammers insert hyperlinks into a paragraph, 
where the anchor text looks like plain text. Here's a para- 
graph of text on a site: 

The SEC) company follows strict rules to 
insure the clients website reach the top of 
stafch, fiHglnes and sta'y tfiefe. 

A user wouldn't see any hyperlinks, even if they moused 
over every word in the paragraph. But if you happened to 
click on just the right word, you'd get whisked away to a 
SEO site. Actually, there is a hidden link under the anchor 
text "SEO company". If you view the source of the page, 
here's what you'll see: 

The <a href= "http://www.seomarketleaders.com" 
onMouseOver= "window. status=";return true;" 
style= "cursor:text;color:black;text-decoration:none;" > 
SEO company </a> follows strict rules to insure the clients 
website reach the top of search engines and stay there. 

Using a similar method, a link can be hidden in a small 
character - for example, a hyphen in the middle of a parar- 
graph. 

3.5 E: Placing Hyperlinks in High-Speed Scrolling 
Blocks 

The <marquee> tag is a non-standard HTML element 
which causes text to scroll up, down, left or right auto- 
matically [14]. Although the W3C advises against its use 
in HTML documents, it's still widely used. SCROLLAM- 
OUNT attribute sets the speed of the scrolling. A big- 
ger value for SCROLLAMOUNT makes the marquee scroll 
faster. If the SCROLLAMOUNT value is big enough, the 
scrolling block will be invisible to the naked eye. Here is a 
simple example. 

<marquee height=l width=8 scrollamount=3000> 
<a href= "target.html" >keywords< /a> 

</marquee> 

The default SCROLLAMOUNT value is 6. The value in 

the example is 3000, which is too fast to see. 

Similar effects can also be achieved through the use of 
JavaScript or HTML <blink> element [14] [1]. 

3.6 F: Putting Links outside the Screen 

Using cascading style sheets, you have the option to ab- 
solutely or relatively position any division. Using absolute 
position, you can simply position the text you wish to hide 
any number of pixels ofi' the screen to the left of the window. 
Here are some example codes: 

If you put that in your style sheet and then assign the 
class "hiddenclass" to your div, then the div will display 977 
pixels to the left of the visible screen - i.e., it will not appear 
on the screen. Here is a example: 

The absolute position can also be set in the div directly 
as follows: 



.hiddenclass { 

position : absolute; 
left : -977px; 

} 

<div id= "hiddenclass" > 

<a href= "target .html" >keywords< / a> 
</div> 

<div style="left: -977px; position: absolute; top: -977px"> 

<a href= "target.html" >keywords< / a> 
</div> 

In the example above, left : —977 may be written in more 
complex formats, such as left : expression{23 — 1000). 
In addition to the methods described above, users can 

use CSS text-indent property or margin-left property to put 
hyperlinks outside the screen. A example is presented below. 

<div style= "text- indent: -9999px; "> 

<a href= "target.html" >keywords</a> 
</div> 

3.7 G: Using VisibilityrHidden or Display:None 
Style Commands 

An alternative to the method above is to simply use the 
built in features of style sheets to hide hyperlinks: 

.hiddenclass { 

visibility : hidden; 

} 

Again, if you put that into a style sheet and then assign 
the class "hiddenclass" to your div, the hyperlinks in the div 
block will not appear in the browser window. 

3.8 H: Hiding Hyperlinks via JavaScript 

JavaScript is am open source programming language com- 
monly implemented as part of a web browser in order to 
create enhanced user interfaces and dynamic websites [6]. 
Google claims that search engines have difficulty accessing 
JavaScript [7]. In 2011, labnol.org reported that Google in- 
dexes JavaScript based Facebook comments, but there is 
no clear report that Google parsers JavaScript codes on the 
whole Web. This fact encourages spammers to hide hyper- 
links by the aid of JavaScript. Here is a simple example: 

<script language=" JavaScript" type="text/javascript"> 
document. write( "<div style='visibility:hidden'>" ); 

< /script > 

<a href= "target.html" >keywords< / a> 

<script language=" JavaScript" type="text/javascript"> 

document. write( "</div>" ); 
< /script > 

The example is easy to understand, which is a packaging 
of the method described in section 3.7. In the above codes. 



<div> and </div> tags arc embed in JavaScript codes sep- 
arately, which may not be indexed by search engines. How- 
ever, the hyperlink target.html is displayed in html codes, 
which is more likely to indexed by search engines. 

In a similar manner, almost all the link hidden techniques 
described in this section can be further tlisguised with JavaScript. 
Next, let's look into a more complex example. 

<div id="gnOOO"> 

<a href= "target.html" title= "keyword" >keyword</a> 

</div> 

<script languagc= "JavaScript" > 

var_a;a= ["\x64\69\x73\x70\x6C\x61\x79", "\x6E\x6F\x6E\x65" , 
"\x71\x6c\x31\x30\x30\x30" , "\x73\x74\x79\x6C\x65" , 
"\x67\x65\x74\x45\x6C\x65\x6D\x65\x6E\x74\x42\x79\x49\x64"]; 
document [_a;a [4]] {_xa [2] ) [_xa [3]] [-xa [0]] =_a;a [1] ; 

< /script > 

The above JavaScript codes are designed in rather vague 
terms. The elements of array _a;a are written with ASCII 
characters. The last line of the above JavaScript codes is 

document ['gctElcmcntByld'] ('qllOOO') ['style'] ['display'] ='none', 
which makes all the content, including hyperlinks, in the 
div named qllOOO invisible. In order to avoid presenting the 
whole style assignment directly, a script can build up the 
style assignment via string concatenation. One very straight 
forward example is presented below. 

<script type="text/javascript"> 

document.getElementById("q" -|- "1" -|- "1000").style.display 

="n" + "o" + "ne"; 
< /script > 

What is worse, JavaScript as a programming language, 

has many functions and operators, which throw off a hu- 
man readers. The following codes show the flexibility of 
JavaScript. 

<script language="javascript">function HexTostring(s){ 
var r="; 

for(var i=0;i<s.length;i-|-=2){ 

var sxx=parselnt (s.substring(i,i-|-2) , 16) ; 

r+= String. fromCharCode (sxx) ; } 
return r;} 

eval(HcxTostring("646f63756d656c742c676574456c656d65 
6e74427949642822716c3130303022292e7374796c652e6469 
73706c6179203d20226e6f6e6522")); 
< /script > 

These codes are essentially equivalent to the previous ex- 
ample, yet look completely different. 



3.9 I: Hiding Hyperlinks via Cloaking or Redi- 
rection Techniques 

Cloaking is a Web spam technique in which the page pre- 
sented to the search engine spider is different from that 
presented to the user's browser [16]. Some spammers hide 
target hyperlinks using cloaking technique. Similarly, spam- 
mers also use redirection techniques to hide targeting hyper- 
links. Among the redirection spam techniques, JavaScript 
based redirection is the most notorious and difficult to catch 
[5]. Wu et al. [16] and Chellapilla et al. [5] have conducted 
comprehensive studies of cloaking and redirection techniques 
respectively, so the techniques will not be repeated here. 
However, it's important to point out that we do not con- 
sider the redirected target URL, but the hyperlinks in the 
redirection page as hidden links. For example, A redirects 
to B, and C is a hyperlink in page A. In this paper, C is a 
hidden link, but B is not seen as a hidden link. 

3.10 J: Hiding Hyperlinks in Pull-Down Menu 

Pull-down menu is also called a drop-down menu, which 
is a menu of commands or options that appears when you 
select an item with a mouse. A drop down menu can make it 
easier to display a large list of choices - since only one choice 
is displayed initially, the remaining choices can be displayed 
when the user activates the dropbox. Some spammers insert 
the target hyperlinks into a long pull-down list, which are 
hard to find. 

3.11 K: Inserting Links into Long Title or Meta 
Tags 

Generally, web browsers show the preceding part of a long 
title. Thus, some spammers insert urls into long title. Sim- 
ilarly, meta tags provide structured meta data about a Web 
page and they are used for search engines. Although they 
have been the targets of spammers for a long time and search 
engines consider these data less and less, there are pages still 
using them. 

3.12 L: Hiding Div "Below" the Visible Layer 

Another sneaky way to hide a hyperlink from Web users 
while keeping it available to the seaxch engines is to put the 
hyperlinks in a layer that is "behind" the visible layer. The 

CSS z- index property specifies the stack order of an element, 
which is supported in all major browsers. An clement with 
a greater stack order is always in front of an element with 
a lower stack order. One example hiding hyperlinks via z- 
index is presented below. 

<div id="front" styles" posit ion:absolute; z-index:l"> 

<img src="image.gif" > 
</div> 

<div id="back" style="position:absolute; z-index:-l"> 

<a h.Tef="target.html" target="_blank">keywords</a> 
</div> 

The codes show that the second div has a negative stack 
order, which determines the target.html is behind the image.gif. 

Besides z-index, "overflow:hidden" can also hide the hy- 
perlinks below the visible layer. Here is a simple example. 



<style type="text/css"> 

#spam{width:99px;height:20px;overflow:hidden;position:absolute;} 
#spam a{display:block;line-height:20px;text-decoration:none;} 

</stylc> 

<div id="spam"> 

<a href='7"> </a> 

<a hTei="target.html" title="keywords">keywords</a> 
</div> 



In the example above, target.html is covered by a non- 
breaking space. 

4. PREVALENCE OF LINK HIDING TECH- 
NIQUES 

In this section, we study the prevalent of hidden spam 
links, and how prevalence of the variety of techniques de- 
scribed in Section 3. We further study whether the hidden 
pages are punished by Google via PageRank analysis. 

We carried out the analysis on 5,583,451 Chinese home- 
pages {http://www. + domain name), including .com, .net 
and .cn domain names. To detect the Web pages with hid- 
ing links, we first train a cost sensitive naive bayes classifier 
on 63 pages with hidden links and 181 normal pages. The 
cost sensitive model ensures a high recall of pages with hid- 
den links. Then, we filtered the 5,583,451 pages with the 
trained model. The detection results contain quite a few 
false alarms, but it's enough for us to analyze the prevalence 
of hidden links. By random sampling from the suspicious set 
and carrying out manual verification, we approximately de- 
termined the number of pages with hidden links. Table 1 
tabulates the statistics in detail. 



Table 1: Percentage occurrence of hidden link spam 
among Chinese Web pages 



URL Type 


Count / Total 


Percentage 


gov.cn 


994/41405 = 1/42 




2.4% 


.com/. net/. cn 


81765/(5583451 - 41405) = 


1/68 


1.48% 



It is noticed that a number of Chinese pages use hyper- 
link hiding techniques. In comparison with ordinary pages, 
gov.cn pages are more likely to contain hidden links. The 
underlying reason is that the government sites usually have 
better credibility. Spammers consider that the recommen- 
dations from these sites will help to boost the ranking of 
target sites. Is it really the case? We will try to answer it 
by analyzing the PageRank values of the hidden links. 

To analyze the prevalence of the variety of techniques de- 
scribed in Section 3, we randomly sampled 160 pages with 
hidden links from gov.cn set, and randomly sampled 118 
pages with hidden links from .com/. net/. cn set. Each sam- 
pled hidden link spam page was manually analyzed. All the 
278 samples were labeled with the types of techniques they 
used. Besides, all the hidden links are extracted for further 
analysis. In total, 9864 target hyperlinks are hidden in the 



278 pages ^ . 

Table 2 describes the prevalence of hidden link techniques 
in detail. 



Table 3: Comparison of average PageRank values 



Table 2: 
niques 



Prevalence of different link hiding tech- 



techniques 


nuinber(percentage)=>number of hidden links 


A 


6 (2.2%) => 39 


B 


3 (1.1%) 29 


C 


8 (2.9%) 33 


D 


19 (6.8%) => 21 


E 


8 (2.9%) => 117 


F 


68 (24.5%) 4333 


G 


30 (10.8%) 1876 


H 


111 (39.9%) => 3210 


I 


5 (1.8%) => 122 


J 


9 (3.2%) 31 


K 


15 (5.4%) => 6 


L 


3 (1.1%) 47 


AU 


285 (285/278=102.5%) => 9864 



The table shows that the 278 Web pages contain 9864 
hidden links. F, G and H are the most popular link hidden 
techniques, which account for 75.3% of that total. These 
three techniques can be easily used to hide multiple hyper- 
links. It can be observed that some of the 278 web pages 
contain more than one link hidden technique. 

Are the 278 source pages and the 9864 target pages pun- 
ished by the search engines? We do not know the detailed 
ranking strategy of commercial search engines, but we can 
explore this problem from a side by analyzing the PageRank 
values of the source and target hyperlinks. 

Google provides a public interface, toolbarqueries.google.com, 
for querying the PageRank values. Given the 278 source urls 
and the overwhelming target hyperlinks do not have hierar- 
chical path, in the form of http : // + hostname. We com- 
pared the PageRank values of the hostnames in our analysis. 
A hostname is a domain name assigned to a host computer, 
which is usually a combination of the host's local name with 
its parent domain's name. For example, hostname{http : 
// en.wikipedia.org/wiki/ Hostname) = en.wikipedia.org. 
In the rest of the section, the PageRank of a hyperlink or 
url refers to the PageRank of it's corresponding hostname. 

Table 3 shows the average PageRank values of source 
links, target hidden links and randomly selected 29994 host- 
names from DNS resolution logs. Table 3 shows that the av- 
erage PageRank values of source urls is 2.216, which means 
that spammers usually select the reputable pages to hide 
target hyperlinks. And the 9864 hidden links have an aver- 
age PageRank value 1.32, which is higher than that of the 
randomly selected hostnames. The result means that Google 
does not establish the effective punitive mechanism for the 



^The data is freely available 

http:/ /gengguanggang.wix.com/gggeng#!hidden- link- 
spam /cqzl. 





source Urls 


hidden links 


randomly selected hosts 


Number 


278 


9864 


29994 


Average PagcRanks 


2.216 


1.320 


1.139 



hidden links. There is certainly another possibility: Google 
has punished these hidden links, which should have higher 
PageRank values if they do not cheat. 

We further analyzed whether the links hidden in govern- 
ment sites have higher PageRank values. Our study doesn't 
support this conclusion. The average PageRank value of 
links hidden in gov.cn sites is 1.319; however, that of links 
hidden in other sites is 1.342. 

Figure 1 describes the distribution of PageRank values of 
278 source links. 




P^eRark Value 



Figure 1: Distribution of PageRank values of 278 
source links. 

It can be observed that spammers tend to insert the hid- 
den hyperlinks into pages with high PageRank values. And 
in this way, spammers hope to boost the PageRank values 
of the target links. 

Figure 2 describes the distribution of PageRank values of 
9864 hidden links. From figure 2, we can see that 53.9% 
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Figure 2: Distribution of PageRank values of 9864 
hidden links. 

hidden hyperlinks have PageRank values. However, quite 
a number of spam links have high PageRank values. Figure 
2 shows that more than 5.4% hidden links PageRank values 
are greater than or equal to 5. 



Next, we analyzed the average PageRank values of tar- 
get hyperlinks using different hidden techniques, which are 
described in detail in figure 3. 
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Figure 3: The average PageRank values of target 
links hidden with different techniques. 

Figure 3 shows that the hyperlinks hidden via H, I or 
J techniques have greater PageRank values. Links hidden 
with H and I techniques have big PageRank values is pretty 
easy to understand for Google ignoring the JavaScript codes. 
However, why hyperlinks hidden in pull down menu have big 
PageRank values is puzzling. Maybe the spam links scatter 
in the drop down list, which is relatively hard for search 
engines to recognize. 

Finally, we analyze the high-frequency words in the an- 
chor texts of the 9864 target hyperlinks. The top 10 high- 
frequency keywords and the corresponding types are tabu- 
lated as follows. 



^Kej^vords 


Term Frequency 


Type ^1 




328 


Gambling sites 




258 


Gambling sites 




239 


Illicit game sen ers 




221 


Gambling sites 




172 


Online noyels 




167 


Medical ser\ices 




104 


Gambling sites 




95 


Gambling sites 




73 


Medical services 




65 


Nav igation sites 



Figure 4: The high-frequency words in the anchor 
texts of the target hyperlinks. 

The statistics show that gambling sites, personal game 
servers and medical services are the main types of the hidden 
links. Most of the sites belong to shady or illegal industries. 

5. CONCLUSION AND FUTURE WORK 

In this paper we presented a variety of commonly used link 
hiding techniques, and organized them into a taxonomy. We 
analyzed the prevalence of common link hiding techniques 
on the web and discussed whether search engines punish the 
hidden links via PageRank analysis. Just as the previous 
work on Web spam [8] [5], we argue that such a structured 



discussion of the subject is important to raise the awareness 
of the research community. Given that most of the sites 
using link hidden techniques are shady or illegal industries, 
more should be done to punish the hidden link spam. 

In the future, we should pay more attention to two things. 
The first is studying link hidden spam on a bigger data set, 
which includes multilingual samples. The second is devel- 
oping a proper countermeasure to address the problem as 
a whole, despite the variety of different link hidden tech- 
niques. One possible solution draws support from maturing 
optical character recognition techniques (OCR) [11]. The 
motivation is that as a computer vision technique, OCR can 
only read the visible content on the Web page like humans. 
The snapshot of a Web page can be easily taken via some 
softwares, such as wkhtmltopdf [3] and snapshotter [2]. All 
the visual text on the snapshot image can be recognized via 
OCR techniques as textVector. If an anchor text does not 
exist in the textVector, the corresponding hyperlink is iden- 
tified as hidden link. And, of course, the relative position of 
anchor text should also be taken into account. 
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